第三方库,需要手动安装
pip install requests
简单示例:
1
2
3
4
5
6
7
8
9
10
11
12
13import requests
response = requests.get('http://www.python-requests.org/en/master/')
print(response.status_code)
print(response.text)
'''
200
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>..................
'''
2个重要对象
Request对象
Response对象
| Response最常用的5个属性 | 说明 |
| ———————– | ———————————————————— |
| r.status_code | HTTP请求的返回状态,200为成功 |
| r.text | 根据encoding属性来返回HTTP报文体 |
| r.encoding | 从HTTP header中charset字段猜测的响应内容编码方式 ,如果没有这个地段,则默认ISO-8859-1 |
| r.apparent_encoding | 从内容中分析除的响应内容编码方式(备选编码方式) |
| r.content | HTTP相应内容的二进制形式 || Response常用方法 | 说明 |
| ——————– | ————————————— |
| r.raise_for_status() | 如果不是200,产生异常requests.HTTPEroor |
6种异常
网络连接不总是稳定可靠的
所以最好用try-except和raise_for_status搭配来处理异常
异常 | 说明 |
---|---|
requests.ConnectionError | 网络连接错误异常,如DNS查询失败,拒绝连接等 |
requests.HTTPError | HTTP错误异常 |
requests.URLRequired | URL缺失异常 |
requests.TooManyRedirects | 超过最大重定向次数,产生重定向异常 |
requests.ConnectTimeout | 连接远程服务器超时异常 |
requests.Timeout | 请求URL超时,产生超时异常 |
7个主要方法
方法 | 说明 |
---|---|
requests.requests() | 构造一个请求,支撑以下各方法的基础方法 |
requests.get() | 对于HTTP的GET请求 |
requests.head() | 对于HTTP的HEAD请求 |
requests.post() | 对于HTTP的POST请求 |
requests.put() | 对于HTTP的PUT请求 |
requests.patch() | 对于HTTP的PATCH请求 |
requests.delete() | 对于HTTP的DELETE请求 |
request方法
最基础也是最核心的方法,其他的方法都是由它封装而来。
requests.request(method, url, **kwargs)
method:请求方法,对应get,head,post,put等7种方法
url:要访问的网络资源
**kwargs:13个控制访问参数
| 控制访问参数 | 描述 |
| ————— | ———————————————- |
| params | 字典或字节序列,作为参数增加到url中 |
| data | 字典、字节序列或文件对象,作为Request的内容 |
| json | JSON格式的数据,作为Request的内容 |
| headers | 字典,HTTP定制头 |
| cookies | 字典或CookieJar,Request中的cookie |
| auth | 元组,支持HTTP认证功能 |
| files | 字典类型,传输文件 |
| timeout | 设定超时时间,秒为单位 |
| proxies | 字典类型,设定访问代理服务器,可以增加登录认证 |
| allow_redirects | True/False,默认True,重定向开关 |
| stream | True/False,默认True,获取内容立即下载开关 |
| verify | True/False,默认True,认证SSL证书开关 |
| cert | 本地SSL证书路径 |1
2kv = {'key1':'value1','key2':'value2'}
st = '发送内容'1
2
3
4
5
6# params
r = requests.request('GET','http://python123.io/ws', params = kv)
print(r.url)
'''
https://python123.io/ws?key1=value1&key2=value2
'''1
2
3# data
r = requests.request('POST','http://python123.io/ws', data = kv)
r = requests.request('POST','http://python123.io/ws', data = st)1
2# json
r = requests.request('POST','http://python123.io/ws', json = kv)1
2
3# headers
hd = {'user-agent':'Chrome/10'}
r = requests.request('POST', 'http://python123.io/ws', headers = hd)1
2
3
4
5
6
7
8
9# files
fs = {
'file':open('data.xls', 'rb')
}
r = requests.request(
'POST',
'http://python123.io/ws',
files=fs
)1
2
3
4
5
6# timeout
r = requests.request(
'GET',
'http://www.baidu.com',
timeout = 10
)1
2
3
4
5
6
7
8
9# proxies
pxs = {
'http':'http://user:pass@10.10.10.1:1234', 'https':'https://10.10.10.1:4321'
}
r = requests.request(
'GET',
'http://www.baidu.com',
proxies=pxs
)get方法
最常用
r = requests.get(url, params=None, **kwargs)
url是需要获取的页面链接
params是url重点额外参数,字典或字节流格式
**kwargs是12个控制访问的参数,同上
requests.get是由requests.request方法封装的
r是Response对象,包含服务器返回的所有信息。
head方法
requests.head(url, **kwargs)
1
2
3
4
5
6
7r = requests.head('http://www.python-requests.org/en/master/')
print(r.headers)
print(r.text)
'''
{'Content-Length': '0', 'Content-Type': 'text/html', 'Content-Encoding': 'gzip', 'Last-Modified': 'Thu, 13 Dec 2018 21:34:51 GMT', 'ETag': 'W/"5c12d07b-c7e2"', 'Vary': 'Accept-Encoding', 'Server': 'nginx', 'X-Cname-TryFiles': 'True', 'X-Served': 'Nginx', 'X-Deity': 'web01', 'Date': 'Fri, 22 Feb 2019 10:23:20 GMT'}
''
'''post方法
requests.post(url, data=None, json =None, **kwargs)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26# 向URL post一个字典,自动编码为form表单
payload = {'key1':'value1', 'key2':'value2'}
r = requests.post('http://httpbin.org/post/', data = payload)
print(r.text)
'''
{
"args": {},
"data": "",
"files": {},
"form": {
"key1": "value1",
"key2": "value2"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "23",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.21.0"
},
"json": null,
"origin": "171.125.82.158, 171.125.82.158",
"url": "https://httpbin.org/post"
}
'''1
2
3
4
5
6# 向URL POST一个字符串 自动编码为data
r = requests.post('http://httpbin.org/post', data = 'ABC')
print(r.text)
'''
'{\n "args": {}, \n "data": "ABC", \n "files": {}, \n "form": {}, \n "headers": {\n "Accept": "*/*", \n "Accept-Encoding": "gzip, deflate", \n "Content-Length": "3", \n "Host": "httpbin.org", \n "User-Agent": "python-requests/2.21.0"\n }, \n "json": null, \n "origin": "171.125.82.158, 171.125.82.158", \n "url": "https://httpbin.org/post"\n}\n'
'''1
2
3
4
5# post 提交表单
data = {'first name':'xxxx','last naem':'hhh'}
r = requests.post(url,data)
print(r.text)1
2
3
4# post 提交照片
file = {'uploadFile':open('./image.jpg','rb')}
r = requests.post(url,files=file)
print(r.text)1
2
3
4
5
6
7
8
9
10
11
12# 使用cookies
payload = {'first name':'xxxx','last naem':'hhh'}
r = requests.post(
'http://pythonscraping.com/pages/cookies/welcome.php',
data=payload
)
print(r.cookies.get_dict())
r = request.get(
'http://pythonscraping.com/pages/cookies/profile.php',
cookies=r.cookies)
print(r.text)1
2
3
4
5
6
7
8
9
10
11
12# 使用session
session = requests.Session()
payload = {'first name':'xxxx','last naem':'hhh'}
r = session.post(
'http://pythonscraping.com/pages/cookies/welcome.php',
data=payload
)
print(r.cookies.get_dict())
r = session.get(
'http://pythonscraping.com/pages/cookies/profile.php'
)
print(r.text)put方法
requests.put(url, data=None, **kwargs)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25payload = {'key1':'value1', 'key2':'value2'}
r = requests.put('http://httpbin.org/put', data = payload)
print(r.text)
'''
{
"args": {},
"data": "",
"files": {},
"form": {
"key1": "value1",
"key2": "value2"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "23",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.21.0"
},
"json": null,
"origin": "171.125.82.158, 171.125.82.158",
"url": "https://httpbin.org/put"
}
'''patch
requests.patch(url, data=None, **kwargs)
delete
requests.delete(url, **kwargs)
实例
京东商品页面爬取
1 | import requests |
亚马逊商品页面爬取
1 | # 增加UA |
百度、360搜索关键字提交
1 | ''' |
网络图片的爬取和存储
1 | import requests |
IP地址归属地查询
1 | ''' |